First, we import the required libraries which we will use to perform the current analysis.
library(tidyverse)
library(naniar)
library(bookdown)
library(stringr)
library(stringi)
library(lubridate)
library(DT)
library(forcats)
library(ggthemes)
library(corrplot)
library(mltools)
library(data.table)
library(visdat)
library(janitor)
library(cowplot)
library(caTools)
library(pscl)
library(ROCR)
library(caret)
library(xgboost)
library(randomForest)
library(lightgbm)
library(Matrix)
library(catboost)
library(magrittr)
library(fmsb)
library(plotly)
library(TTR)
library(broom)
What are we trying to study ?
Time series forecasting is a powerful analytical technique used to predict future values based on historical data. It plays a crucial role in various domains such as finance, economics, weather forecasting, supply chain management, and more. By analyzing patterns, trends, and dependencies within a time series dataset, forecasting models aim to provide accurate predictions and insights into future behavior.
The first step in time series forecasting is to understand the characteristics of the data. Time series data consists of a sequence of observations collected over time, where each observation is associated with a specific timestamp. These observations may exhibit trends, seasonality, cyclic patterns, or irregularities, which need to be identified and accounted for in the forecasting process.
- The internet
Great ! We have all the libraries loaded. Next, we are gonna load the required dataset for conducting the enzyme classification analysis.
We will use one dataset for the purpose of exploratory data analysis and training the prediction model while the test dataset for testing the prediction model on a completely new dataset.
After reading the data, let us see how the train dataset looks like.
df_train <- read_csv("data/train.csv")
df_test <- read_csv("data/test.csv")
head(df_train)
## # A tibble: 6 × 6
## id date country store product num_sold
## <dbl> <date> <chr> <chr> <chr> <dbl>
## 1 0 2017-01-01 Argentina Kaggle Learn Using LLMs to Improve Your C… 63
## 2 1 2017-01-01 Argentina Kaggle Learn Using LLMs to Train More LLMs 66
## 3 2 2017-01-01 Argentina Kaggle Learn Using LLMs to Win Friends an… 9
## 4 3 2017-01-01 Argentina Kaggle Learn Using LLMs to Win More Kaggl… 59
## 5 4 2017-01-01 Argentina Kaggle Learn Using LLMs to Write Better 49
## 6 5 2017-01-01 Argentina Kaggle Store Using LLMs to Improve Your C… 88
We observe that the dataset is fairly simple with the following features.
The dataset appears to be fairly simple and concise. We will retain all the available features in this dataset except for the “id” column.
df_train <- df_train %>% select(-id)
In this step, we will try to check for the presence of null values in the dataset.
gg_miss_var(df_train)
Figure 3.1: Missingness in the dataset
Based on the figure 3.1, we can observe that
✅ The dataset does not contain any missing values. This indicates that we have a clean dataset which is ready for EDA and further analysis.
In this section, we will try to visualise the various features and try to obtain key insights through the usage of these visualisations.
Let us try to observe the number of product sales for each country.
df_sales_count <- df_train %>% group_by(country) %>% summarise(count = n())
pl1 <- ggplot(data = df_sales_count,aes(x = country,y = count,fill = country)) + geom_col(color = 'black') + theme_classic() + geom_label(aes(label = count)) + labs(x = "Country",y = "Number of products sold") + ggtitle("Country wise distribution of sales") + theme(legend.position = 'none',plot.title = element_text(hjust = 0.5))
pl1
Figure 4.1: Country wise distribution of sales
Based on figure 4.1, we can observe that,
💡 the dataset contains equally distributed number of sales for each country. This is ideal to create our prediction model as the model can be trained better without the presence of any bias originating through hetergenous data. 💡
df_date_sale <- df_train %>% group_by(date) %>% summarise(tot_sold = sum(num_sold))
pl2 <- ggplot(data = df_date_sale,aes(x = date,y = tot_sold),group = date) + geom_line(color = 'blue') + theme_classic() + ggtitle("Total sales globally") + labs(y = "Total sales",x = "Date of purchase") + theme(plot.title = element_text(hjust = 0.5)) +
annotate("segment",x = ymd(20200101),
y = 5500,xend = ymd(20200401) ,
yend = 8000 ,arrow = arrow(type = "closed",
length = unit(0.02, "npc"))
) +
annotate("text",x = ymd(20200101),
y = 5000,colour = "red",
label = 'Dip in total sales',
size = unit(3, "pt"))
pl2
Figure 4.2: Total courses sold
While we have observed the total global sales in section 4.2, let us observe the overall trend line using a simple moving average function.
df_date_sale_sma <- df_date_sale %>% SMA(n=7)
plot.ts(df_date_sale_sma)
title("Trend line of global sales \n with 1 week moving average")
Based on figure 4.2, we can observe that
💡 there is a strong seasonality observed in the data. The sales are observed to peak during the period of new year everytime. However, an unexpected drop in sales were observed in the year of 2020. The sales could be affected as a result of COVID-19 restrictions. 💡
df_date_sale_country <- df_train %>% group_by(date,country) %>% summarise(tot_sold = sum(num_sold))
pl4 <-ggplot(data = df_date_sale_country,
aes(x = date, y = tot_sold, color = country),
group = date) + geom_line() + theme_classic() + ggtitle("Total sales in all countries") + labs(y = "Total sales", x = "Date of purchase", color =
"Country") + theme(plot.title = element_text(hjust = 0.5))
pl4
Figure 4.3: Total courses sold in each country
Based on figure 4.3, we can observe that
💡 the strong seasonality is observed equally in each of the 5 countries. The peaks and troughs are observed to appear around the same time of the year for all the countries. The sales were observed to be the highest for Canada, followed by Japan,Spain, Estonia and Argentina. The sales were particularly underwhelming in the country of Argentina. 💡
Let us observe the product wise sales in the following visualisation.
df_prod_sale <- df_train %>% group_by(date,product) %>% summarise(tot_sold = sum(num_sold))
pl5 <-ggplot(data = df_prod_sale,
aes(x = date, y = tot_sold, color = product),
group = date) + geom_line(alpha = 0.7) + theme_classic() + ggtitle("Total sales of products in all countries") + labs(y = "Total sales", x = "Date of purchase", color =
"Product") + theme(legend.position = 'none')
ggplotly(pl5)
Figure 4.4: Product wise sales globally
Based on figure 4.4, we can observe that
💡 there is a sinusoidal seasonality in the sales of most Kaggle products. However, the product “Using LLMs to Win Friends and Influence People” does not show much seasonality and have much lower sales compared to the rest of the products. 💡
After observing the sales in terms of products and location, let us check how do the sales fare for each store of Kaggle.
df_store_sale <- df_train %>% group_by(date,store) %>% summarise(tot_sold = sum(num_sold))
pl6 <-ggplot(data = df_store_sale,
aes(x = date, y = tot_sold, color = store),
group = date) + geom_line(alpha = 0.7) + theme_classic() + ggtitle("Total sales of each store") + labs(y = "Total sales", x = "Date of purchase", color =
"Store") + theme(plot.title = element_text(hjust = 0.5))
ggplotly(pl6)
Figure 4.5: Total sales of each store
Upon analysing figure 4.5, we can observe that
💡 the seasonal peaks in the sales for each of the Kaggle store are in close synchonisation to each other. However, there is a distinct difference in the volume of sales for each store. It can be observed that the sales for “Kagglazon” are signficantly higher than the “Kaggle Store” and the “Kaggle Learn” stores. 💡
After analysing the data through our visualisations in the previous sections, we can start preparing the dataset for the ML algorithms. This would require us to tweak the data into a tidy format by transforming the same.
This involves converting categorical data such as Country, Store and Products into encoded data.
df_train$country <- factor(df_train$country)
df_train$store <- factor(df_train$store)
df_train$product <- factor(df_train$product)
dt_train <- data.table(df_train)
dt_train <- one_hot(dt_train,cols = c("country","store","product"))
df_train <- as.data.frame(dt_train)
✅ All right ! We have finally prepaared our dataset. In the next step, we will now separate the target label from the input dataframe for the purpose of training and testing our prediction model.
The datasets for training and testing will now be prepared.
set.seed(101)
sample=sample.split(df_train$num_sold,SplitRatio=0.7)
train=subset(df_train,sample==T)
test=subset(df_train,sample==F)
Let us utilise the linear regression technique to predict the number of sold products.
model_lr <- lm(num_sold~.,data=train)
glance(model_lr)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.750 0.750 92.1 26107. 0 11 -569619. 1139264. 1.14e6
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
As we can observe,
💡 The linear regression model fared moderately while predicting the total number of sales with the model having an R-squared score of 75%. 💡
model_aug <- augment(model_lr)
fitted.results <- predict(model_lr,newdata=subset(test,select=-(num_sold)))
After fitting the linear regression model on the train dataset and predicting the values with the test dataset, let us see how do the fitted and actual values vary.
df_lr <- as.data.frame(test$num_sold)
df_lr <- df_lr %>% rename("Actual_values" = "test$num_sold")
df_lr$fitted <- fitted.results
pl7 <- ggplot(data = df_lr,aes(x = Actual_values,y = fitted)) + geom_point() + geom_smooth(method = 'lm',aes(color = "Linear regression prediction")) + theme_classic() + labs(x="Actual values",y = "Predicted values",color = "Model") + ggtitle("Predicted and actual values \n in Linear Regression model") + theme(plot.title = element_text(hjust=0.5))
pl7
Figure 6.1: Predicted and actual values in Linear Regression model
💡 we can observe that the linear regression does not do a great job at predicting the number of sales. This could be as a result of the fact that the linear regression model is sensitive to outliers. Another reason can be due to the fact that not all phenomena and circumstances can be accurately described by a linear regression model. The current problem statement maybe poorly described by a linear model. 💡